Pengantar Pemrograman Triton: Model Eksekusi Paralel: Berpikir dalam Blok

Mengalihkan dari pemrograman CPU secara serial ke pemrograman GPU membutuhkan perubahan paradigma: dari iterasi elemen demi elemen ke eksekusi berbasis blok. Kita tidak lagi melihat data sebagai aliran skalar, tetapi sebagai kumpulan "blok" yang dijadwalkan untuk memenuhi bandwidth perangkat keras.

1. Terbatas oleh Memori vs. Terbatas oleh Komputasi

Keterbatasan sebuah kernel ditentukan oleh rasio operasi matematika terhadap akses memori. Penjumlahan vektor sering kali terbatas oleh memori karena hanya melakukan satu penambahan untuk setiap tiga operasi memori (2 muat, 1 simpan). Perangkat keras menghabiskan lebih banyak waktu menunggu DRAM daripada menghitung.

2. Peran BLOCK_SIZE

BLOCK_SIZE menentukan tingkat granularitas paralelisme. Jika terlalu kecil, kita kurang memanfaatkan jalur eksekusi lebar GPU. Ukuran optimal memastikan adanya cukup "pekerjaan yang sedang berjalan" untuk memenuhi bus memori.

3. Menyembunyikan Latensi Melalui Occupancy

Occupancy adalah jumlah blok aktif pada GPU. Meskipun bukan tujuan utama, hal ini memungkinkan scheduler untuk mengganti blok baru agar melakukan komputasi sementara blok lain menunggu pengambilan memori dengan latensi tinggi dari VRAM.

4. Pemanfaatan Perangkat Keras

Untuk memaksimalkan kinerja, kita harus menyesuaikan BLOCK_SIZE dengan aturan penggabungan memori arsitektur GPU, memastikan bahwa thread yang berurutan mengakses alamat memori yang berurutan.

TERMINALbash — 80x24

> Ready. Click "Run" to execute.

QUESTION 1

For a kernel that adds two vectors ($out = x + y$), what is the most likely bottleneck on modern GPUs?

Arithmetic Throughput

Memory Bandwidth

Shared Memory Latency

QUESTION 2

What is the primary purpose of 'Occupancy' in the GPU execution model?

To ensure every thread runs as fast as possible.

To hide memory latency by keeping work in flight.

To increase the clock speed of the compute units.

To reduce the power consumption of the HBM.

QUESTION 3

Which of the following describes 'Memory-Bound' behavior?

The GPU is waiting for the memory bus to deliver data.

The GPU has exhausted its available VRAM.

The kernel is performing too many complex floating-point operations.

The CPU cannot launch kernels fast enough.

QUESTION 4

What happens if the BLOCK_SIZE is set too small?

The kernel will fail with a memory error.

The GPU fails to utilize its wide SIMD execution lanes.

The memory bandwidth increases significantly.

QUESTION 5

In the logistics warehouse analogy, what represents the 'Blocks'?

The individual items.

The workers.

The organized pallets.

The delivery trucks.